graph LR
A["Traditional Benchmarks<br/>(MMLU, GPQA, etc.)<br/>90%+ accuracy"] --> B["Benchmark<br/>Saturation"]
B --> C["Humanity's Last Exam<br/>2,500 expert questions<br/>Best models < 45%"]
C --> D["Meaningful signal<br/>for frontier AI"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
Humanity’s Last Exam (HLE)
The hardest AI benchmark ever built: 2,500 expert-level questions designed to be the final closed-ended academic exam for AI
Keywords: Humanity’s Last Exam, HLE, AI benchmark, frontier LLM evaluation, CAIS, Scale AI, expert-level questions, calibration error, MMLU saturation, multi-modal benchmark, LLM leaderboard

Introduction
AI benchmarks are critical for measuring LLM progress — but most of them are already saturated. Frontier models now score over 90% on popular benchmarks like MMLU and GPQA, making them ineffective at distinguishing between state-of-the-art models.
Humanity’s Last Exam (HLE) was created to address this. It is a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind. Where other benchmarks have become routine for frontier LLMs, HLE remains brutally difficult — with even the best models scoring well below 50%.
“HLE tests structured academic problems rather than open-ended research or creative problem-solving abilities, making it a focused measure of technical knowledge and reasoning. HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI.” — HLE Paper
What Is Humanity’s Last Exam?
Humanity’s Last Exam (HLE) is a multi-modal benchmark consisting of 2,500 questions across dozens of academic subjects — mathematics, humanities, natural sciences, and more. It was designed to test both:
- Depth of reasoning — world-class mathematical and scientific problems
- Breadth of knowledge — questions spanning over 100 subject areas
Key Characteristics
| Feature | Details |
|---|---|
| Total questions | 2,500 (public) + private held-out set |
| Subjects covered | 100+ across math, humanities, natural sciences |
| Question types | Multiple-choice (24%) and short-answer (76%) |
| Multi-modal | 14% of questions require understanding images/diagrams |
| Grading | Automated (closed-form, unambiguous answers) |
| Anti-contamination | Private test set to detect overfitting; canary strings |
What Makes It So Hard?
Every question in HLE was:
- Created by subject-matter experts — nearly 1,000 contributors across 500+ institutions in 50+ countries (professors, researchers, PhD holders)
- Required to stump frontier LLMs — a question only passed the initial bar if models could not answer it correctly
- Manually reviewed by expert reviewers with graduate degrees in relevant fields
- Verified unsearchable — questions that could be easily answered via web search were removed
The dataset started with over 70,000 submissions. Only 13,000 passed the LLM difficulty filter. After expert human review, a finalized set of 2,500 public questions remained.
graph TD
A["70,000+ submissions<br/>from global experts"] --> B["13,000 passed<br/>LLM difficulty filter"]
B --> C["Expert human review<br/>(graduate-level reviewers)"]
C --> D["2,700 accepted"]
D --> E["Remove searchable<br/>& flagged questions"]
E --> F["2,500 finalized<br/>public questions"]
style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
style B fill:#f39c12,color:#fff,stroke:#333
style C fill:#e67e22,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style E fill:#8e44ad,color:#fff,stroke:#333
style F fill:#27ae60,color:#fff,stroke:#333
Who Built It?
HLE was developed by the Center for AI Safety (CAIS) and Scale AI, with lead authors:
- Long Phan, Nathaniel Li, Adam Khoja, Richard Ren — Center for AI Safety
- Alice Gatti, Ziwen Han, Josephina Hu, Hugh Zhang — Scale AI
- Summer Yue, Alexandr Wang — Scale AI (senior leads)
- Dan Hendrycks — Center for AI Safety (senior lead)
Contributors competed for a $500,000 USD prize pool ($5,000 for each of the top 50 questions, $500 for the next 500 questions), along with optional co-authorship.
Publication
HLE was published in Nature (Nature 649, 1139–1146, January 2026), one of the most prestigious scientific journals, underscoring its significance to the research community.
| Resource | Link |
|---|---|
| Nature paper | nature.com/articles/s41586-025-09962-4 |
| arXiv preprint | arxiv.org/abs/2501.14249 |
| Website | lastexam.ai |
| GitHub | github.com/centerforaisafety/hle |
What Skills Does It Test?
Unlike narrowly focused benchmarks, HLE tests a broad spectrum of expert-level academic capabilities:
graph TD
HLE["Humanity's Last Exam<br/>2,500 questions"] --> M["Mathematics<br/>& Logic"]
HLE --> S["Natural Sciences<br/>Physics, Chemistry, Biology"]
HLE --> H["Humanities<br/>History, Classics, Philosophy"]
HLE --> CS["Computer Science<br/>& Engineering"]
HLE --> Med["Medicine<br/>& Life Sciences"]
HLE --> Other["Other Disciplines<br/>Economics, Law, Linguistics..."]
style HLE fill:#e74c3c,color:#fff,stroke:#333
style M fill:#3498db,color:#fff,stroke:#333
style S fill:#27ae60,color:#fff,stroke:#333
style H fill:#f39c12,color:#fff,stroke:#333
style CS fill:#8e44ad,color:#fff,stroke:#333
style Med fill:#e67e22,color:#fff,stroke:#333
style Other fill:#6cc3d5,color:#fff,stroke:#333
| Capability | What HLE Tests |
|---|---|
| Deep reasoning | Multi-step mathematical proofs, complex derivations |
| Expert knowledge | Cutting-edge scientific facts, obscure domain knowledge |
| Multi-modal understanding | Questions with diagrams, inscriptions, chemical structures |
| Calibration | Whether models know what they don’t know (confidence estimation) |
| Resistance to search | Knowledge that cannot be trivially retrieved via internet search |
Example Questions
HLE questions span extraordinary breadth — from translating Palmyrene script on Roman tombstones (Classics) to identifying the number of paired tendons supported by a hummingbird’s sesamoid bone (Ecology/Anatomy). This diversity is what makes HLE uniquely challenging.
Current Leaderboard
The leaderboard below shows model accuracy on HLE as published on the SEAL LLM Leaderboard by Scale AI. Rankings use Rank (Upper Bound): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound, ensuring rankings reflect statistically meaningful differences.
Source: SEAL LLM Leaderboard — Humanity’s Last Exam (consulted March 28, 2026). Dataset updated April 3, 2025, with finalized 2,500 questions. Judge model: o3-mini.
| Rank | Model | Accuracy (%) | Calibration Error |
|---|---|---|---|
| 1 | GPT-5.4 Pro | 44.32 ± 1.95 | 38 |
| 2 | Gemini 3 Pro Preview | 37.52 ± 1.90 | 57 |
| 2 | GPT-5.4 (xhigh thinking) | 36.24 ± 1.88 | 42 |
| 2 | Claude Opus 4.6 (thinking max) | 34.44 ± 1.86 | 46 |
| 4 | GPT-5 Pro | 31.64 ± 1.82 | 49 |
| 6 | GPT-5.2 | 27.80 ± 1.76 | 45 |
| 6 | GPT-5 | 25.32 ± 1.70 | 50 |
| 6 | Claude Opus 4.5 (thinking) | 25.20 ± 1.70 | 55 |
| 6 | Kimi K2.5 | 24.37 ± 1.81 | 67 |
| 7 | GPT-5.1 (thinking) | 23.68 ± 1.67 | 55 |
| 9 | Gemini 2.5 Pro (Jun 05) | 21.64 ± 1.61 | 72 |
| 11 | o3 (high) | 20.32 ± 1.58 | 34 |
| 11 | GPT-5 Mini | 19.44 ± 1.55 | 65 |
| 11 | o3 (medium) | 19.20 ± 1.54 | 39 |
| 11 | Claude Opus 4.6 (non-thinking) | 19.00 ± 1.54 | 44 |
Key takeaway: Even the best frontier model (GPT-5.4 Pro) scores only 44.32% — meaning more than half the questions remain unsolved. Most models exhibit high calibration errors, indicating systematic overconfidence.
For the full, up-to-date leaderboard, visit the links in the next section.
Where to Explore the Benchmark
Dashboards and Leaderboards
| Resource | Description | Link |
|---|---|---|
| SEAL LLM Leaderboard | Scale AI’s official leaderboard with confidence intervals and calibration | labs.scale.com/leaderboard/humanitys_last_exam |
| CAIS AI Dashboard | Center for AI Safety’s dashboard with HLE-Rolling live submission | agi.safe.ai/dashboard |
| HLE Website | Official website with paper, results, and progress chart | lastexam.ai |
Dataset and Code
| Resource | Description | Link |
|---|---|---|
| Hugging Face Dataset | The full 2,500-question dataset (requires access agreement) | huggingface.co/datasets/cais/hle |
| GitHub Repository | Evaluation code, prompts, and documentation | github.com/centerforaisafety/hle |
| arXiv Paper | Full technical paper with methodology and analysis | arxiv.org/abs/2501.14249 |
| Nature Publication | Peer-reviewed publication | nature.com/articles/s41586-025-09962-4 |
Load the Dataset
from datasets import load_dataset
dataset = load_dataset("cais/hle", split="test")HLE-Rolling
In October 2025, the team released HLE-Rolling — a dynamic, evolving fork of the benchmark that accepts new contributions over time. This ensures HLE remains relevant as models improve.
Understanding the Metrics
Accuracy
The primary metric. Models answer each question, and an automated judge (o3-mini) compares the response against the ground-truth answer. Because answers are closed-form and unambiguous, evaluation is deterministic.
Calibration Error
Models are prompted to provide both an answer and a confidence score (0–100%). Calibration error measures the gap between stated confidence and actual accuracy.
| Scenario | Confidence | Accuracy | Calibration |
|---|---|---|---|
| Well-calibrated | 50% | 50% | Good |
| Overconfident | 85% | 10% | Bad (CE: 75+) |
| Current frontier models | 60–90% | 5–45% | Bad (CE: 34–89) |
Key insight: Most frontier models are systematically overconfident on HLE — they express high confidence even when wrong. This is strong evidence of confabulation/hallucination. The o3 model family shows the best calibration (CE: 34–39), while older models like GPT-4o exhibit calibration errors of 89.
Why HLE Matters
graph LR
A["Benchmark<br/>Saturation"] --> B["Cannot distinguish<br/>frontier models"]
B --> C["HLE fills the gap"]
C --> D["Informed AI policy<br/>& research"]
A2["Overconfident<br/>models"] --> B2["Calibration errors<br/>not flagged"]
B2 --> C
C --> D2["Better safety<br/>assessments"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Measures what matters — Expert-level academic reasoning, not just pattern matching
- Resists saturation — Even the best models score < 50%
- Exposes overconfidence — Calibration metrics reveal when models are hallucinating
- Informs policy — Provides a common reference point for scientists and policymakers
- Anti-contamination — Private held-out set detects overfitting to the public dataset
Video: Humanity’s Last Exam Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
Humanity’s Last Exam represents a milestone in AI evaluation:
- 2,500 expert-crafted questions across 100+ subjects that frontier LLMs still largely cannot solve
- Built by ~1,000 subject-matter experts from 500+ institutions across 50+ countries
- Published in Nature — peer-reviewed and validated by the scientific community
- The best model scores 44% — vast room for improvement remains
- Calibration errors reveal that models don’t know what they don’t know
As AI capabilities advance, HLE provides a meaningful yardstick for measuring genuine progress — not just incremental improvements on already-saturated benchmarks. When models eventually achieve high accuracy on HLE, it will signal a profound leap in AI’s ability to match expert human knowledge on closed-ended academic questions.
But as the authors note: “HLE may be the last academic exam we need to give to models, but it is far from the last benchmark for AI.”
References
- Phan, L., Gatti, A., Han, Z., Li, N. et al. “A benchmark of expert-level academic questions to assess AI capabilities.” Nature 649, 1139–1146 (2026). doi:10.1038/s41586-025-09962-4
- Phan, L., Gatti, A., Han, Z., Li, N. et al. “Humanity’s Last Exam.” arXiv preprint arXiv:2501.14249 (2025). arxiv.org/abs/2501.14249
- Center for AI Safety & Scale AI. “Humanity’s Last Exam — Official Website.” lastexam.ai
- Scale AI. “SEAL LLM Leaderboard — Humanity’s Last Exam.” labs.scale.com/leaderboard/humanitys_last_exam (consulted March 28, 2026)
- Center for AI Safety. “HLE Dataset.” Hugging Face. huggingface.co/datasets/cais/hle
- Center for AI Safety. “HLE GitHub Repository.” github.com/centerforaisafety/hle
Read More
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
- Understand quantization trade-offs for evaluation — see Quantization Methods for LLMs
- HLE Official Website
- SEAL LLM Leaderboard
- HLE Dataset on Hugging Face
- HLE GitHub Repository